An error is better than a surprise
Recall our both_na() function from Chapter 2, that finds the number of entries where vectors x and y both have missing values:
both_na <- function(x, y) {
sum(is.na(x) & is.na(y))
}We had an example where the behavior was a little surprising:
x <- c(NA, NA, NA)
y <- c( 1, NA, NA, NA)
both_na(x, y)## Warning in is.na(x) & is.na(y): Länge des längeren Objektes
## ist kein Vielfaches der Länge des kürzeren Objektes
## [1] 3
The function works and returns 3, but we certainly didn’t design this function with the idea that people could pass in different length arguments.
Using stopifnot() is a quick way to have your function stop, if a condition isn’t met. stopifnot() takes logical expressions as arguments and if any are FALSE an error will occur.
# Define troublesome x and y
x <- c(NA, NA, NA)
y <- c( 1, NA, NA, NA)
both_na <- function(x, y) {
# Add stopifnot() to check length of x and y
stopifnot(length(x) == length(y))
sum(is.na(x) & is.na(y))
}
# Call both_na() on x and y
#both_na(x, y)An informative error is even better
Using stop() instead of stopifnot() allows you to specify a more informative error message. Recall the general pattern for using stop() is:
if (condition) {
stop("Error", call. = FALSE)
}Writing good error messages is an important part of writing a good function! We recommend your error tells the user what should be true, not what is false. For example, here a good error would be “x and y must have the same length”, rather than the bad error “x and y don’t have the same length”.
Let’s use this pattern to write a better check for the length of x and y.
# Define troublesome x and y
x <- c(NA, NA, NA)
y <- c( 1, NA, NA, NA)
both_na <- function(x, y) {
# Replace condition with logical
if (length(x) != length(y)) {
# Replace "Error" with better message
stop("x and y must have the same length", call. = FALSE)
}
sum(is.na(x) & is.na(y))
}
# Call both_na()
#both_na(x, y)A different kind of surprise: side effects
Side effects describe the things that happen when you run a function that alters the state of your R session. If foo() is a function with no side effects (a.k.a. pure), then when we run x <- foo(), the only change we expect is that the variable x now has a new value. No other variables in the global environment should be changed or created, no output should be printed, no plots displayed, no files saved, no options changed. We know exactly the changes to the state of the session just by reading the call to the function.
Can you identify which of these functions doesn’t have side effects?
show_missings <- function(x) {
n <- sum(is.na(x))
cat("Missing values: ", n, "\n", sep = "")
x
}
replace_missings <- function(x, replacement) {
x[is.na(x)] <- replacement
x
}
plot_missings <- function(x) {
plot(seq_along(x), is.na(x))
x
}
exclude_missings <- function() {
options(na.action = "na.exclude")
}answer: replace_missings.
Of course functions with side effects are crucial for data analysis. You need to be aware of them, and deliberate in their usage. It’s ok to use them if the side effect is desired, but don’t surprise users with unexpected side effects.
sapply is another common culprit
sapply() is another common offender returning unstable types. The type of output returned from sapply() depends on the type of input.
Consider the following data frame and two calls to sapply():
df <- data.frame(
a = 1L,
b = 1.5,
y = Sys.time(),
z = ordered(1)
)
A <- sapply(df[1:4], class)
B <- sapply(df[3:4], class)
str(A)## List of 4
## $ a: chr "integer"
## $ b: chr "numeric"
## $ y: chr [1:2] "POSIXct" "POSIXt"
## $ z: chr [1:2] "ordered" "factor"
str(B)## chr [1:2, 1:2] "POSIXct" "POSIXt" "ordered" "factor"
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:2] "y" "z"
What type of objects will be A and B be?
A will be a list, B will be a character matrix.
Using purrr solves the problem
This unpredictable behaviour is a sign that you shouldn’t rely on sapply() inside your own functions.
So, what do you do? Use alternate functions that are type consistent! And you already know a whole set: the map() functions in purrr.
In this example, when we call class() on the columns of the data frame we are expecting character output, so our function of choice should be: map_chr():
library(purrr)
df <- data.frame(
a = 1L,
b = 1.5,
y = Sys.time(),
z = ordered(1)
)
A <- map_chr(df[1:4], class)
B <- map_chr(df[3:4], class)
str(A)
str(B)Except that gives us errors: Error: Result 3 must be a single string, not a character vector of length 2. This is a good thing! It alerts us that our assumption (that class() would return purely character output) is wrong.
Let’s look at a couple of solutions. First, we could use map() instead of map_chr(). Our result will always be a list, no matter the input.
library(purrr)
# sapply calls
A <- sapply(df[1:4], class)
B <- sapply(df[3:4], class)
C <- sapply(df[1:2], class)
# Demonstrate type inconsistency
str(A)## List of 4
## $ a: chr "integer"
## $ b: chr "numeric"
## $ y: chr [1:2] "POSIXct" "POSIXt"
## $ z: chr [1:2] "ordered" "factor"
str(B)## chr [1:2, 1:2] "POSIXct" "POSIXt" "ordered" "factor"
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:2] "y" "z"
str(C)## Named chr [1:2] "integer" "numeric"
## - attr(*, "names")= chr [1:2] "a" "b"
# Use map() to define X, Y and Z
X <- map(df[1:4], class)
Y <- map(df[3:4], class)
Z <- map(df[1:2], class)
# Use str() to check type consistency
str(X)## List of 4
## $ a: chr "integer"
## $ b: chr "numeric"
## $ y: chr [1:2] "POSIXct" "POSIXt"
## $ z: chr [1:2] "ordered" "factor"
str(Y)## List of 2
## $ y: chr [1:2] "POSIXct" "POSIXt"
## $ z: chr [1:2] "ordered" "factor"
str(Z)## List of 2
## $ a: chr "integer"
## $ b: chr "numeric"
A type consistent solution
If we wrap our solution into a function, we can be confident that this function will always return a list because we’ve used a type consistent function, map():
col_classes <- function(df) {
map(df, class)
}But what if you wanted this function to always return a character string?
One option would be to decide what should happen if class() returns something longer than length 1. For example, we might simply take the first element of the vector returned by class().
col_classes <- function(df) {
# Assign list output to class_list
class_list <- map(df, class)
# Use map_chr() to extract first element in class_list
map_chr(class_list, 1)
}
# Check that our new function is type consistent
df %>% col_classes() %>% str()## Named chr [1:4] "integer" "numeric" "POSIXct" "ordered"
## - attr(*, "names")= chr [1:4] "a" "b" "y" "z"
df[3:4] %>% col_classes() %>% str()## Named chr [1:2] "POSIXct" "ordered"
## - attr(*, "names")= chr [1:2] "y" "z"
df[1:2] %>% col_classes() %>% str()## Named chr [1:2] "integer" "numeric"
## - attr(*, "names")= chr [1:2] "a" "b"
Or fail early if something goes wrong
Another option would be to simply fail. We could rely on map_chr()’s type consistency to fail for us:
col_classes <- function(df) {
map_chr(df, class)
}
df %>% col_classes() %>% str()Or, check the condition ourselves and return an informative error message. We’ll implement this approach in this exercise.
As you write more functions, you’ll find you often come across this tension between implementing a function that does something sensible when something surprising happens, or simply fails when something surprising happens. Our recommendation is to fail when you are writing functions that you’ll use behind the scenes for programming and to do something sensible when writing functions for users to use interactively.
(And by the way, flatten_chr() is yet another useful function in purrr. It takes a list and removes its hierarchy. The suffix _chr indicates that this is another type consistent function, and will either return a character string or an error message.)
col_classes <- function(df) {
class_list <- map(df, class)
# Add a check that no element of class_list has length > 1
if (any(map_dbl(class_list, length) > 1)) {
stop("Some columns have more than one class", call. = FALSE)
}
# Use flatten_chr() to return a character vector
flatten_chr(class_list)
}
# Check that our new function is type consistent
#df %>% col_classes() %>% str()
#df[3:4] %>% col_classes() %>% str()
#df[1:2] %>% col_classes() %>% str()Programming with NSE functions
Let’s take a look at a function that uses the non-standard evaluation (NSE) function filter() from the dplyr package:
big_x <- function(df, threshold) {
dplyr::filter(df, x > threshold)
}This big_x() function attempts to return all rows in df where the x column exceeds a certain threshold. Let’s get a feel for how it might be used.
diamonds_sub <- diamonds[1:1000,]
# Use big_x() to find rows in diamonds_sub where x > 7
big_x(diamonds_sub, 7)## # A tibble: 1 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 1.27 Premium H SI2 59.3 61 2845 7.12 7.05 4.2
When things go wrong
Now, let’s see how this function might fail. There are two instances in which the non-standard evaluation of filter() could cause surprising results:
The
xcolumn doesn’t exist indf.
There is athresholdcolumn indf.
Let’s illustrate these failures. In each case we’ll use big_x() in the same way as the previous exercise, so we should expect the same output. However, not only do we get unexpected outputs, there is no indication (i.e. error message) that lets us know something might have gone wrong.
# Remove the x column from diamonds
diamonds_sub$x <- NULL
# Create variable x with value 1
x <- 1
# Use big_x() to find rows in diamonds_sub where x > 7
big_x(diamonds_sub, 7)## # A tibble: 0 x 9
## # ... with 9 variables: carat <dbl>, cut <ord>, color <ord>,
## # clarity <ord>, depth <dbl>, table <dbl>, price <int>, y <dbl>, z <dbl>
# Create a threshold column with value 100
diamonds_sub$threshold <- 100
# Use big_x() to find rows in diamonds_sub where x > 7
big_x(diamonds_sub, 7) ## # A tibble: 0 x 10
## # ... with 10 variables: carat <dbl>, cut <ord>, color <ord>,
## # clarity <ord>, depth <dbl>, table <dbl>, price <int>, y <dbl>,
## # z <dbl>, threshold <dbl>
Instead of failing with an error of warning, big_x() gave an incorrect answer. This is dangerous!
What to do?
To avoid the problems caused by non-standard evaluation functions, you could avoid using them. In our example, we could achieve the same results by using standard subsetting (i.e. []) instead of filter(). For more insight into dealing with NSE and how to write your own non-standard evaluation functions, we recommend reading Hadley’s vignette on the topic. Also, programming with the NSE functions in dplyr will be easier in a future version.
If you do need to use non-standard evaluation functions, it’s up to you to provide protection against the problem cases. That means you need to know what the problem cases are, to check for them, and to fail explicitly.
To see what that might look like, let’s rewrite big_x() to fail for our problem cases.
big_x <- function(df, threshold) {
# Write a check for x not being in df
if (!"x" %in% colnames(df)) {
stop("df must contain variable called x", call. = FALSE)
}
# Write a check for threshold being in df
if ("threshold" %in% colnames(df)) {
stop("df must not contain variable called threshold", call. = FALSE)
}
dplyr::filter(df, x > threshold)
}Legitimate use of options
In general, you want to avoid having the return value of your own functions depend on any global options. That way, you and others can reason about your functions without needing to know the current state of the options.
It is, however, okay to have side effects of a function depend on global options. For example, the print() function uses getOption("digits") as the default for the digits argument. This gives users some control over how results are displayed, but doesn’t change the underlying computation.
Let’s take a look at an example function that uses a global default sensibly. The print.lm() function has the options digits with default max(3, getOption("digits") - 3).
# Start with this
options(digits = 8)
# Fit a regression model
fit <- lm(mpg ~ wt, data = mtcars)
# Look at the summary of the model
summary(fit)##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5431 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2.2e-16 ***
## wt -5.3445 0.5591 -9.559 1.294e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.0459 on 30 degrees of freedom
## Multiple R-squared: 0.75283, Adjusted R-squared: 0.74459
## F-statistic: 91.375 on 1 and 30 DF, p-value: 1.294e-10
# Set the global digits option to 2
options(digits = 2)
# Take another look at the summary
summary(fit)##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.543 -2.365 -0.125 1.410 6.873
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.285 1.878 19.86 < 2e-16 ***
## wt -5.344 0.559 -9.56 1.3e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3 on 30 degrees of freedom
## Multiple R-squared: 0.753, Adjusted R-squared: 0.745
## F-statistic: 91.4 on 1 and 30 DF, p-value: 1.29e-10
Session Info
sessionInfo()## R version 3.5.2 (2018-12-20)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=German_Switzerland.1252 LC_CTYPE=German_Switzerland.1252
## [3] LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C
## [5] LC_TIME=German_Switzerland.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] purrr_0.3.0 ggplot2_3.1.0 dplyr_0.8.0.1 gapminder_0.3.0
## [5] kableExtra_1.0.1 knitr_1.21
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.0 plyr_1.8.4 pillar_1.3.1
## [4] compiler_3.5.2 prettydoc_0.2.1 tools_3.5.2
## [7] digest_0.6.18 gtable_0.2.0 evaluate_0.12
## [10] tibble_2.0.1 viridisLite_0.3.0 pkgconfig_2.0.2
## [13] rlang_0.3.1 cli_1.0.1 rstudioapi_0.9.0
## [16] yaml_2.2.0 xfun_0.4 withr_2.1.2
## [19] httr_1.4.0 stringr_1.4.0 xml2_1.2.0
## [22] hms_0.4.2 webshot_0.5.1 grid_3.5.2
## [25] tidyselect_0.2.5 glue_1.3.0 R6_2.4.0
## [28] fansi_0.4.0 rmarkdown_1.11 readr_1.3.1
## [31] magrittr_1.5 scales_1.0.0 htmltools_0.3.6
## [34] assertthat_0.2.0 rvest_0.3.2 colorspace_1.4-0
## [37] utf8_1.1.4 stringi_1.3.1 lazyeval_0.2.1
## [40] munsell_0.5.0 crayon_1.3.4